Evaluation of an optimized germline exomes pipeline using BWA-MEM2 and Dragen-GATK tools

The next-generation sequencing (NGS) technology represents a significant advance in genomics and medical diagnosis. Nevertheless, the time it takes to perform sequencing, data analysis, and variant interpretation is a bottleneck in using next-generation sequencing in precision medicine. For accurate and efficient performance in clinical diagnostic lab practice, a consistent data analysis pipeline is necessary to avoid false variant calls and achieve optimum accuracy. This study aims to compare the performance of two NGS data analysis pipeline compartments, including short-read mapping (BWA-MEM and BWA-MEM2) and variant calling (GATK-HaplotypeCaller and DRAGEN-GATK). On Whole Exome Sequencing (WES) data, computational performance was assessed using several criteria, including mapping efficiency, variant calling performance, false positive calls rate, and time. We examined four gold-standard WES data sets: Ashkenazim father (NA24149), Ashkenazim mother (NA24143), Ashkenazim son (NA24385), and Asian son (NA25631). In addition, eighteen exome samples were analyzed based on different read counts, and coverage was used precisely in the run-time assessment. By using BWA-MEM 2 and Dragen-GATK, this study achieved faster and more accurate detection for SNVs and indels than the standard GATK Best Practices workflow. This systematic comparison will enable the bioinformatics community to develop a more efficient and faster solution for analyzing NGS data.


Introduction
The development of the next generation of high-throughput next-generation sequencing (NGS) platforms and a reliable and consistent method of identifying genetic variants have made it possible to use personal genome information to detect and identify patients' genetic variations and etiological factors of diseases. However, the analysis and interpretation of these large-scale sequencing data continue to be challenging regarding accurate variant detection, including SNVs and indels, and turnaround time efficiency.
The sequence mapping and variant calling techniques available today are diverse and can be incorporated into diverse pipelines. An NGS pipeline typically includes an aligner and a variant caller: the aligner maps sequencing reads to a reference genome, and the variant caller identifies variant sites and genotypes. The performances of different aligners have been studied extensively [1][2][3][4]. A widely used tool for aligning sequencing reads to a reference genome is the Burrows-Wheeler Aligner (BWA) [5]. There are three versions of BWA: BWA-backtrack, BWA-SW, and BWA-MEM. Each has been designed to handle sequence read problems; for example, BWA-backtrack is better for short sequences, BWA-SW is more sensitive to frequent gaps in the reads, while BWA-MEM is preferred for 70bp reads and more. In BWA-MEM2, the newest version of BWA-MEM, the alignment is identical to BWA, but the runtime is 1-3 times faster depending on the use case, dataset, and the machine [6].
There have been many different variants calling algorithms developed over the years [7][8][9]. When calling variants, the key challenge is distinguishing between false positives and true variants (due to contamination, library preparation artifacts, sequencing artifacts, mismapping). Indel errors are especially difficult to detect because they are more likely to occur in tandem repeats. The probability of detecting the error depends on the period and the repeat length. A widely used variant calling algorithm is GATK-HaplotypeCaller (GATK-HC) from Genome Analysis ToolKit GATK [10]. This algorithm uses a Hidden Markov Model (HMM) to correctly model errors as part of the probability calculation, using predetermined functions that do not depend on individual samples.
A new collaboration that brings together the GATK team and the DRAGEN team (an Illumina company) to co-develop analysis methods and pipelines provides enhanced accuracy in the open-source software version (labeled DRAGEN-GATK). By adding sample-specific logic before variant calling and explicitly considering STR period and length within a new technique called DragSTR, DRAGEN could model the indel error process more accurately. This step, known as sample-specific logic, is part of the auto-calibration step used to find the probability of indel errors and variants based on the BAM input. Their testing confirmed better detection performance in terms of sensitivity and precision: Previously missed calls were recovered, and false positives were removed.
It is hence crucial to evaluate variant calling pipelines that combine aligners and variant callers with the optimal combination that can yield accurate variant calls including SNVs and indels. Our aim in this study was to evaluate the performance of the optimized BWA-MEM2 alignment tool and the optimized Dragen-GATK variant caller tool for detecting SNVs and indels separately. In order to accomplish this objective, we used human whole exome sequencing data from Genome in a Bottle (GiaB) high-confidence GRCh37 reference sets from Ashkenazim father (NA24149), Ashkenazim mother (NA24143), Ashkenazim son (NA24385), and Asian son (NA24631). The time performance comparison was also conducted using eighteen exome samples collected from different read counts and coverage levels since the four reference samples from GiaB were approximate of the same size in yields and read counts perspective. All the analyses were performed on the supercomputer Aziz. Several performance statistics are reported here with respect to execution time, recall, and precision.

Datasets
As a gold standard practice to validate the whole exome sequencing pipelines performance, The highly confident variant call-sets in the Variant Call Format (VCF) of the Chinese son (HG005_NA24631), and Ashkenazi Jewish trio sets consist of the mother (HG004_NA24143), father (HG003_NA24149) and son (HG002_NA24385) using hg19 coordinates, were downloaded from the Genome in a Bottle (GIAB) website (https://ftp-trace.ncbi.nlm.nih.gov/giab/ ftp/release/).
In addition, we collected real data from eighteen patients who were sequenced at Genati labs (which is a College of American Pathologists accredited laboratory) in the Center of Excellence in Genomic Medicine Research (CEGMR) at King Abdulaziz University. This study was approved by IRB no. 32-CEGMR-Bioeth-2021. Since the four reference samples from GiaB were of the same size in yields and read counts perspective, we selected the eighteen exome sequencing files to be of different yields, coverage and mapped reads to evaluate the pipelines' time performance. To facilitate the time performance comparison, these samples were distributed into four groups based on fastq data yield. Six samples were of the smallest amount yield (2-3 GB), eight samples of 4-6 GB yield, 3 samples of 8-11 GB yield, and one sample with the largest yield of 17 GB. Table 1, provides detailed information on these eighteen data sets grouped based on fastq data yield, and passing filter (PF) clusters for both lane 1 and lane 2.

Exome sequencing
All samples in the study were sequenced using Illumina Novaseq 6000 sequencing platform to produce 2x101 paired-end reads. The Illumina DNA Prep with Enrichment library was used to capture the sequences of 214,126 targeted exonic regions. The start and stop of chromosome locations in GRCH37/hg19 were downloaded from the Illumina website: (https://support. illumina.com/downloads/nextera-dna-exome-product-files.html). The GIAB and the 18 fastq files contained more than 94% bases with quality that is higher than Q30.

WES bioinformatics pipelines
The first pipeline used in the comparison was the GATK Best Practices workflow (https:// software.broadinstitute.org/gatk/best-practices) for analyzing NGS data [11]. We started by trimming the raw paired-end reads represented in FASTQ files by removing the low-quality reads and adapter sequences with trimmomatic (version 0.36). Then, FastqQC (version 3) was applied to evaluate the quality of the fastq data obtained. The raw paired-end reads were aligned with GRCH37 reference genome using BWA-mem aligner (version 0.7.12), to generate BAM files, which were then sorted by genome position using samtools (version 1.2), then marked for duplicates using picard-tools (version 2.2.1). The raw BAM files were refined by BQSR using default parameters for GATK (version 4.1.2). The next step is to perform variant calling (SNVs and indels) using the HaplotypeCaller module and filter the VCFs based on (allele frequency, read depth, mapping quality, and variant quality score). A report with mapping efficiency, variant calling performance, false positive call rate, and time is generated at the end. An optimized pipeline has been proposed as the second pipeline. This pipeline follows the same workflow structure that is described in the GATK workflow best practices. However, a recently upgraded and improved aligner (BWA-MEM2) and variant caller (Dragen-GATK) were utilized. The optimized pipeline starts with a trimming adapter, and low-quality reads using trimmomatic (version 0.36) and FASTQC (Version 3) are used to assess raw data quality. Using BWA-MEM2 (version 2.2.1), the reads were aligned to GRCH37. Postalignment processing was performed with samtools (version 1.2) to transform sam files into bams, sorting and creating indexes on bams, and with GATK (version 4.2.2) for marking duplicates, base recalibration, and variant calling. The pipeline ended with vcf filtration. We also utilized the spark-enabled tools available in GATK version 4 to optimize performance and CPU utilization [12]. The two pipelines are used in the comparison study as described in Fig 1.

Computational resources
Both pipelines were run using a job array on King Abdulaziz University's High-Performance Computing Center (http://hpc.kau.edu.sa). AZIZ is equipped with Fujitsu PRIMERGY CX400, Intel True Scale QDR, and Intel Xeon E5-2695v2 12C 2.4GHz processors with 256 GB RAM. Each pipeline had the same number of nodes, CPUs, and samples in the job arrays. After that, the mapping efficiency was compared using the quality control reports generated for each sample from both of the pipelines. The metrics used for the comparison are: total targeted reads, total targeted bases, and mean target coverage.

Pipeline comparison metrics
In a final step, the run time of both pipelines was measured. To better assess the run time efficiency, the run time evaluation procedure was divided into three phases: (1) an overall run time performance from the beginning to the end of the pipeline,(2) upstream analysis (Fastq to BAM file), and (3) downstream analysis (BAM to VCF file).

PLOS ONE
Evaluation of an optimized germline exomes pipeline

Variant calling and false positive calls performance assessment
The pipelines were evaluated by calculating true positive (TP), false positive (FP), false negative (FN) variants, precision, and recall using VCAT v4.0.4. Results are presented in Table 2 for SNVs and indels. As for SNVs variant calls, both pipelines demonstrated good performance, with accuracy and recall exceeding 97% and 98%, respectively. The BWA-MEM 2+Dragen-GATK showed higher performance in the sample NA24149 where recall grew from 97% to 99% as a result of true positives and false negatives increasing by 9% and decreasing by 5%, while its performance was slightly improved in the other samples.
Due to the challenging nature of indel calls, BWA-MEM 2+DRAGEN-GATK achieves better performance than GATK best practice workflow in terms of term recall, which was increased by a minimal 1% since the difference in the number of true positives was small between the callers. The number of false positives varied greatly between BWA-MEM 2+DRAGEN-GATK and GATK best practice workflow (Fig 2), which contributed to an increase in indels precision to 95%.
As the BWA-MEM 2+Dragen-GATK pipeline generated reliable results, there is a need to ensure that the results are not just reliable but even reproducible with every run analysis. For that, we assessed the reproducibility of the variants (SNVs/indels) called by the BWA-MEM 2 +Dragen-GATK pipeline using the GIaB samples in four replicated runs. The performance results of the four replicated results for each sample were identical Table 3. Furthermore, we assessed deeply the variant detection of the four replicates. We observed high concordance of the variants among replicated runs (99.9%) while the error rate was (0.019%),(0.002%), (0.09%), (0.07%) in the NA24149, NA24143, NA24385,and NA24631 samples (Fig 3).

Run time performance assessment
This study evaluates the run-time performance of BWA-MEM + Dragen-GATK against the GATK Best Practice pipeline. A total of 18 short-read datasets sampled from different coverage and read counts were analyzed to determine the impact on computation time. In both pipelines, the computation time correlated linearly with the sample read count. Each sample run under each pipeline 20 times, and the average of the time performance was calculated as shown in Table 4. Overall, the BWA-MEM2+DRAGEN-GATK pipeline was much more efficient compared to the standard GATK pipeline, producing a half-time reduction in total analysis time for all samples of different yields (Fig 4).
To further assess the run time differences between the two pipelines, we compared the run time at different phases of pipeline 1. Aligners run time (i.e.time required to generate sorted, marked duplicate Bam file from the fastq file), and 2. variant callers run time (the time required to generate the original and filtered VCF file from the BAM file). Fig 5A shows the  time performance of BWA-MEM2 with the advantage of being more than 2X faster than its precedence, BWA-MEM. To further evaluate mapping results from the two tools and if any discrepancies could impact downstream genetics variants detection, three metrics (total targeted reads, total targeted bases, and mean target coverage) were used to compare the mapping results from the two tools and equivalent mapping results as were achieved (Fig 5B).

Discussion
One significant challenge when performing whole exome sequencing (WES) is how to process the data to detect the exact genetic variants that cause the disease. Alignment and variant calling tools are essential to this process. In this study, we evaluated the performance of three short-read sequence alignment algorithms (BWA-MEM, BWA-MEM2) and variant calling algorithms (GATK-HC, DRAGEN-GATK) on whole exome sequencing data. Evaluation criteria included mapping efficiency, variant calling performance, false positive calls, and time  performance. The performance assessment used a gold-standard set of exomes taken from four Ashkenazim fathers (NA24149), Ashkenazim mothers (NA24143), Ashkenazim sons (NA24385), and Asian sons (NA24631), along with real-world data collected from different read counts and coverage. Samples were sequenced on a Novaseq 6000 sequencing platform (Illumina) with 2 * 101 cycles reads mapped to the hg19 reference human genome assembly (GRCh37). The BWA-MEM2+DRAGEN-GATK pipeline processed germline mutation calling much more efficiently than the standard BWA+GATK pipeline, where the overall analysis time for different samples with different read counts and fastq file sizes decreased to half. Additionally, BWA-MEM2+DRAGEN-GATK is highly sensitive and precise in calling SNVs and indels. Overall, the focus of this study was to demonstrate the performance of the optimized version of the BWA-MEM2 aligner and the DRAGEN-GATK variant caller tools in detecting SNVs and indels compared to the modified versions of BWA-MEM2.

Conclusion
We present a comprehensive evaluation between BWA-MEM 2+DRAGEN-GATK and GATK best practice workflows. All the code used for implementing the two pipelines is available on GitHub https://github.com/Abusamra/GATK-best-practice-vs-optimized-Pipeline-for-Improved-Detection-of-Germline-Exomes-. Based on the results and data analysis, we believe that using the BWA-MEM2 and the DRAGEN-GATK pipeline can be more accurate and efficient for the WES analysis.